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Abstract 


This paper presents an investigation of us- 
ing a co-attention based neural network for 
source-dependent essay scoring. We use a co- 
attention mechanism to help the model learn 
the importance of each part of the essay more 
accurately. Also, this paper shows that the co- 
attention based neural network model provides 
reliable score prediction of source-dependent 
responses. We evaluate our model on two 
source-dependent response corpora. Results 
show that our model outperforms the baseline 
on both corpora. We also show that the atten- 
tion of the model is similar to the expert opin- 
ions with examples. 


1 Introduction 


Manually grading students’ essays is labor inten- 
sive. Therefore, many automated essay scoring 
(AES) methods have been developed to support 
grading essays at scale. However, in different 
grading tasks, the information required by an AES 
system is different. For example, if a system needs 
to assign a holistic score to the essay, the sys- 
tem needs to take all information into account. In 
contrast, if a system needs to assign a score for 
one specific aspect of the essay (e.g. use of evi- 
dence), the system needs to ignore some informa- 
tion. Also, if an essay is a source-dependent es- 
say, the system needs to exploit knowledge of the 
source article. 

This paper focuses on source-dependent essay 
assessment. In this task, students read a source 
article before writing the essay, and assessment 
involves recognizing and analyzing references to 
the article in the essay. We propose a new type 
of co-attention based neural network model tai- 
lored to source-dependent grading, then use two 
source-dependent essay corpora to evaluate our 
model. Our first corpus contains the four source- 
dependent essay sets in the Automated Student 
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Assessment Prize (ASAP) corpus!. The ASAP 
grading task is to assign a holistic score to each 
essay. The second corpus uses the Response to 
Text Assessment (RTA) (Correnti et al., 2013) to 
assess students’ analytic writing skills. Instead of 
evaluating holistic writing skills, the RTA was de- 
signed to evaluate students’ writing skills along 
five dimensions: Analysis, Evidence, Organiza- 
tion, Style, and MUGS (Mechanics, Usage, Gram- 
mar, and Spelling). Our grading task for this cor- 
pus is to assign an Evidence score to each essay, 
by evaluating students’ ability to find and use evi- 
dence from a source article to support their claims. 


The main contributions of this paper are as fol- 
lows. First, we introduce a co-attention based neu- 
ral network model that is fully automated and does 
not need any expert effort to encode knowledge of 
a source article. Second, our co-attention based 
neural network model extends prior work by de- 
signing the model to take a source article into ac- 
count during grading. Third, we apply our model 
to the subset of source-dependent responses tasks 
in the ASAP corpus and show that the model out- 
performs a previous neural network model devel- 
oped for the full corpus. Fourth, we show that 
our model also performs well on the RTA task and 
again significantly outperforms our baseline neu- 
ral net model. Last, we use examples to show that 
our model can assign reasonable attention scores 
to different sentences in the essay. 


In the following sections, we first present re- 
lated research. Then we describe our tasks by in- 
troducing the ASAP corpus and the RTA corpus. 
Next, we explain the structure of our co-attention 
based neural network model. Finally, we discuss 
the results of our experiments and future plans. 
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2 Related Work 


Previous research in AES needed feature engineer- 
ing. In very early work, Page (1968) developed an 
AES tool named Project Essay Grade (PEG) by 
only using linguistic surface features. A more re- 
cent well-known AES system is E-Rater (Burstein 
et al., 1998), which employs many more natural 
language processing (NLP) technologies. Later, 
Attali and Burstein (2004) released E-Rater V2, 
where they created a new set of features to rep- 
resent linguistic characteristic related to organiza- 
tion and development, lexical complexity, prompt- 
specific vocabulary usage, etc. Similarly to Page 
(1968), this system used regression equations for 
assessment of student essays. One limitation of all 
of the above models is that all need handcrafted 
features for training the model. In contrast, our 
model uses a neural network for the AES task and 
thus does not require feature engineering. 

Recently, neural network models have been in- 
troduced into AES, making the development of 
handcrafted features unnecessary or at least op- 
tional. Alikaniotis et al. (2016) and Taghipour 
and Ng (2016) presented AES models that used 
Long Short Term Memory (LSTM) networks. Dif- 
ferently, Dong and Zhang (2016) used a Con- 
volutional Neural Network (CNN) model for es- 
say scoring by applying two CNN layers on both 
the word level and then sentence level. Later, 
Dong et al. (2017) presented another work that 
uses attention pooling to replace the mean over 
time pooling after the convolutional layer in both 
word level and sentence levels. However, none of 
these neural network grading models consider the 
source article if it exists. In this paper, we intro- 
duce a neural network model that takes the source 
article into account by using a co-attention mech- 
anism instead of the self-attention mechanism of 
prior work. 

Our work not only focuses on essay assess- 
ment using a holistic score, but also evaluates a 
particular dimension of argument-oriented writing 
skills, namely use of Evidence. Louis and Higgins 
(2010) analyze only the content of essays by de- 
tecting off-topic essays. Ong et al. (2014) used ar- 
gumentation mining techniques to evaluate if stu- 
dents use enough evidence to support their posi- 
tions. However, these two prior studies are not 
suitable for our task because they did not measure 
the use of content or evidence from a source ar- 
ticle. With respect to source-based dimensional 
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essay analysis, Rahimi et al. (2014, 2017) devel- 
oped a set of rubric-based features that compared 
a student’s essay and a source article in terms of 
number of related words or paraphrases. Zhang 
and Litman (2017) improved their model by intro- 
ducing word embedding into the feature extraction 
process to extract relationships previously missed 
due to lexical errors or use of different vocabulary. 
However, in both of these studies, human effort 
was still necessary for pre-processing the source 
article, for example, by having experts manually 
create a list of important words and phrases in the 
article which the system would compare with fea- 
tures extracted from the student’s essay. In con- 
trast, our work does not need any human effort 
to analyze the source article before essay grading. 
Although Rahimi and Litman (2016) investigated 
extracting example lists by using LDA (Blei et al., 
2003) model, the data-driven model missed an ex- 
ample when there was no essay mentioning the ex- 
ample. Klebanov et al. (2014) predicted which 
parts of the source material were important and 
that students needed to use in their essays. The 
essay score is required to obtain the content impor- 
tance for their work, but our work does not need to 
know the essay score while identifying the content 
importance. 


3 Data 


We use two different essay corpora in our exper- 
iments: source-based essays from the ASAP cor- 
pus, and source-based RTA essays. While the full 
ASAP corpus contains essays in response to 8 dif- 
ferent prompts, we use only essays in response to 
the 4 source-dependent prompts. The gold stan- 
dard ASAP assessment is a holistic score. In con- 
trast, the gold standard assessment in the RTA cor- 
pus is an Evidence score. In particular, the assess- 
ment only considers how students use evidence 
from a source article to support their claims; the 
assessement thus ignores the lexical and syntactic 
mistakes made by students and the organization of 
the essay when assessing the evidence dimension. 


3.1 ASAP 


The Automated Student Assessment Prize (ASAP) 
corpus consists of written responses to 8 prompts. 
Among them, prompts 3, 4, 5, and 6 are source- 
dependent which means students read an article 
before writing their essays. Since the scores as- 
signed to essays are holistic, assessment considers 


Source Excerpt: My mother and fa- 
ther had come to this country with such 
courage, without any knowledge of the 
language or the culture. They came self- 
lessly, as many immigrants do, to give 
their children a better life, even though 
it meant leaving behind their families, 
friends, and careers in the country they 
loved. 


Essay Prompt: Describe the mood cre- 
ated by the author in the memoir. Sup- 
port your answer with relevant and spe- 
cific information from the memoir. 


Figure 1: A source excerpt for ASAP Prompt 5. 


the overall quality of the essay, not just a specific 
dimension. Figure | contains an excerpt from an 
ASAP source article and the associated Prompt 5. 


Prompt 3 4 5 6 

Score 0 39 311 24 44 
(2%) (18%) (1%) (3%) 

Score 1 | 607 636 302 167 
(35%) (36%) (17%) (9%) 

Score 2 | 657 570 649 405 
(38%) (32%) (36%) (23%) 

Score3 | 423 253 572 817 
(25%) (14%) (32%) (45%) 

Score4 | NA NA 258 367 
(14%) (20%) 
Total | 1726 1770 1805 ~~ 1800 


Table 1: The holistic score distribution of ASAP. 


In this paper, we only focus on prompts 3, 4, 5, 
and 6 (denoted by ASAP3, ASAPy, AS APs, and 
AS APs respectively), because they are source- 
dependent responses. In ASAP, different prompts 
have different score ranges. The score range of 
ASAP; and ASAP, is 0 to 3, while the range of 
ASAP; and ASAP, is 0 to 4. Figure 2 shows an 
excerpt of an essay with score of 4 for ASAP3. 
The score distribution is shown in Table 1. 


3.2 RTA 


The RTA corpora were collected from upper ele- 
mentary level students, as described by Correnti 
et al. (2013). There are two forms of RTA based 
on different articles that students read before writ- 
ing essays. The first article is from Time for Kids 
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Essay Excerpt: The author of the mem- 
oir, Narciso Rodriguez creates a caring, 
happy, and thoughtful mood. By men- 
tioning the Cuban traditions shared in 
the neighborhood between close friends, 
and cooking in the kitchen to share a 
great meal with one another the mood 
is happy. When Narciso talks about the 
great friends he made from different her- 
itages and knowing the entire commu- 
nity like family the mood is thoughtful 
and caring because it shows that the peo- 
ple really appreciated each other’s com- 


pany... 


Figure 2: Excerpt of an essay with score of 4 for ASAP 
Prompt 5. 


about the Millennium Villages Project, an effort by 
the United Nations to end poverty in a rural village 
in Sauri, Kenya; we refer to itas RT Ayyyp. The 
other article talks about the importance of space 
exploration; we refer to refer itas RT Agnace. Fig- 
ure 3 shows an excerpt from the RTA ysy p article 
and the associated essay writing prompt. Bolded 
text spans in the article excerpt are pieces of evi- 
dence that our experts (School of Education RTA 
team members) manually labeled as being impor- 
tant for students to include in their essays. 


Source Excerpt: Today, Yala Sub- 
District Hospital has medicine, free of 
charge, for all of the most common 
diseases. Water is connected to the 
hospital, which also has a generator 
for electricity. Bed nets are used in ev- 
ery sleeping site in Sauri... 


Essay Prompt: The author provided 
one specific example of how the quality 
of life can be improved by the Millen- 
nium Villages Project in Sauri, Kenya. 
Based on the article, did the author pro- 
vide a convincing argument that win- 
ning the fight against poverty is achiev- 
able in our lifetime? Explain why or 
why not with 3-4 examples from the text 
to support your answer. 


Figure 3: A source excerpt for the RT Awy p prompt. 


Evidence usage in each RTA essay was scored 


on a scale of 1 to 4 (low to high). The distribution 
of Evidence scores is shown in Table 2. Figure 4 
shows a student essay with a score of 3. Our ex- 
perts manually bolded all pieces of evidence found 
in this essay. 


Essay: In my opinion I think that they 
will achieve it in lifetime. During the 
years threw 2004 and 2008 they made 
progress. People didnt have the money 
to buy the stuff in 2004. The hospi- 
tal was packed with patients and they 
didnt have alot of treatment in 2004. 
In 2008 it changed the hospital had 
medicine, free of charge, and for all 
the common dieases. Water was con- 
nected to the hospital and has a gen- 
erator for electricity. Everybody has 
net in their site. The hunger crisis 
has been addressed with fertilizer and 
seeds, as well as the tools needed to 
maintain the food. The school has no 
fees and they serve lunch. To me thats 
sounds like it is going achieve it in the 
lifetime. 


Figure 4: A RT Ayy p essay with score of 3. 


Prompt RTA MVP RTA S'pace 
Score 1 852 538 
(29%) (26%) 
Score 2 1197 789 
(40%) (38%) 
Score 3 616 512 
(21%) (25%) 
Score 4 305 237 
(10%) (11%) 
Total 2970 2076 


Table 2: The Evidence score distribution of RTA. 


4 Model 


Our network is inspired by the hierarchical neural 
network model presented by Dong et al. (2017). 
In their model, they considered each essay as a 
sequence of sentences rather than a sequence of 
words. Their model has three parts. First, they 
used a convolutional layer and attention pooling 
layer to get sentence representation. Second, they 
used an LSTM layer and another attention pooling 
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layer for document representation. Finally, they 
used a sigmoid layer for score prediction. 

Differently from their model, our model re- 
places the attention pooling layer for document 
representation with a bi-directional attention flow 
layer and an additional modeling layer (Seo et al., 
2017). By doing so, our model considers students’ 
essays associated with a source article and this at- 
tention mechanism captures the relationship be- 
tween the essay and the source article. In partic- 
ular, a higher attention score will be assigned to 
sentences that are mentioned in the article but less 
mentioned in other essays. Our model is a hierar- 
chical neural network and consists of seven layers. 
Figure 5 shows the structure of our network. The 
layers in the dashed box were presented by Dong 
et al. (2017). The sentence level co-attention layer 
was presented by Seo et al. (2017). 


4.1 Word Embedding Layer 


This layer maps each word in sentences to a high 
dimension vector. We use the GloVe pre-trained 
word embeddings (Pennington et al., 2014) to ob- 
tain the word embedding vector for each word. 
It was trained on 6 billion words from Wikipedia 
2014 and Gigaword 5. It has 400,000 uncased vo- 
cabulary items. The dimensionality of GloVe in 
our model is 50 dimensions. The outputs of this 
layer are two matrices, Lz € R°*e*4x for the 
essay and L4 € R°«*W«*¢z for the article, where 
Se, Sa, We, Wa, and dz are number of sentences 
of the essay and the article, length of sentences of 
the essay and the article, and the embedding size, 
respectively. Same to Dong et al. (2017), a dropout 
is applied after the word embedding layer. 


4.2 Word Level Convolutional Layer 


In this layer, we perform 1D convolution over the 
word representations of both Lg and Ly, so that 
we can get local representation of each sentence. 
For each word w; in each sentence, we perform 1D 
convolution: 


pi = g([wi : witk—1] - Up + bp) (1) 


where g is a nonlinear activation, & is the kernel 
size, U, is the filter weight matrix, and b, is the 
bias vector. The outputs of this layer are C. € 
R°exPexde for the essay and C, € R°*Paxdo 
for the article, where P. and P, are filtered lengths 
of sentences of the essay and the article, respec- 
tively. dc is the number of filters of the 1D convo- 
lution layer. 


Score 


Output Layer 


Modeling Layer 


Sentence Level Co-Attention Layer 


Sentence Level LSTM Layer 


Word Level Attention Pooling Layer 


Word Level Convolutional Layer 


Word Embedding Layer 


Source Article 


Figure 5: The Co-Attention Based Neural Network Structure. 


4.3. Word Level Attention Pooling Layer 


After the convolutional layer, a pooling layer is de- 
manded to obtain the sentence representations. In 
this layer, we follow the same design presented by 
Dong et al. (2017). The attention pooling is de- 
fined as equations below: 


m; = tanh(Um - pi + bm) (2) 
elu Mi 

i= = 3 

v Tews (3) 

s= Ss ViPi (4) 


where U,,, Uy and b,, are weight matrix, vector, 
and bias vector, respectively. m,; and v; are atten- 
tion vector and attention weight for p;. The out- 
puts of this layer are A, € R*°**4¢ for the essay 
and A, € R°«*¢c for the article. 


4.4 Sentence Level LSTM Layer 


In this layer, we use a Long Short-Term Memory 
Network (LSTM) (Hochreiter and Schmidhuber, 
1997) over the sentence representations of the es- 
say and the article to capture contextual evidence 
from previous sentences to refine the sentence rep- 
resentation. 

The LSTM unit is a special kind of RNN unit 
which has long-term dependency learning abil- 
ity. LSTMs use three gates to control information 
flow to avoid the long-term dependency problem 
by forgetting or remembering information in each 
LSTM unit. They are an input gate, a forget gate, 
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Essay2Article 


Ha 
He 
Article2Essay 
< 
Pals 
PaiPdime-7 
bald 
[Se 25] a 
He 


and an output gate. The following equations define 


the LSTM unit: 
fi =a(W; - [hi—-1, St] + df) (5) 
it = 0 (W;, - [ht-1, 5] + B:) (6) 
Ct = tanh(W. 7 (he-1, St] + Uy) (7) 
Ce= fee G1 + * (8) 
0 = 0(Wo - [he-1, St] + bo) (9) 
ht = 0, * tanh(c) (10) 


where s; and fh; are the input sentence and the out- 
put state of time t, respectively. Wy, Wi, We, 
and W, are weight matrices. by, bj, b-, and bo 
are bias vectors. o is the sigmoid function, and 
* is element-wise multiplication. The output of 
this layer are H, € R%*¢# for the essay and 
H, € R°«*4# for the article, where dy, is the di- 
mensionality of the output. 


4.5 Sentence Level Co-Attention Layer 


The concept of this layer is presented by Seo et al. 
(2017) in the part of attention flow layer. This 
layer links information from H, and Hg, and gen- 
erates a collection of article aware features vec- 
tor of essay sentences. The attention is computed 
in two directions, from essay to article, and vice 
versa. Both attention scores are figured from a 
similarity matrix by the following equation: 

Sim = We 


sim ~ 


[hee; haj; hae * ha} | + dbsim (11) 


where Wi, is weight matrix, he; and ha, are ti, 
row vector of H, and j;;, row vector of Ha, Desir iS 
bias vector. * is element-wise multiplication. {; | 
is vector concatenation. After obtaining the simi- 
larity matrix Sim € R%*%«, we compute the at- 
tention in two directions. 

Essay to Article Attention measures which 
sentences in the article are similar to each sentence 
in students’ essays. The following equations de- 
fine the essay to article attention: 


Qea = softmax(Sim) (12) 
Ha = QeaHa (13) 


where deg € R*5 represents the attention 
score of each sentence in the article associate with 
each sentence in the essay, softmaz is performed 
across each row. The output of this H, € R°«*4#. 
Article to Essay Attention measures which 
sentences in the essay have the closest meaning to 
one of the sentences in the article. The following 
equations define the article to essay attention: 
dae = softmax(Maxeg(Sim)) (14) 
he = al H. (15) 
where dae € R°*, maxXeo, is a maximum function 
performed across the column, and he € R4, Be- 
cause ™@2c¢,, Will find out which sentence in the 
article has the closest meaning to each sentence 
in the essay, so he represents the attention score 
of the most important sentence in the essay asso- 
ciated with the article. After tiling S, times, the 
final output of this layer is H, € R°«*4#. 


The final output G is a concatenated matrix of 
AI., He, and H, defined by: 


G = |H.; Hg; H. + Hy He * He] (16) 


where * is element-wise multiplication, and [;] is 
concatenation, H, is the original representation of 
essay, Ha is the essay to article attention, He. « Hy 
is the self-aware representation, and H, * He is 
article-aware representation. Therefore, the output 
of this layer is G € R%«*44#, the article-aware 
representation of each sentence in the essay. 


4.6 Modeling Layer 


G is the representation of each sentence, and we 
need the representation of the essay. Therefore, we 
introduce another LSTM layer for modeling the 
essay and only use the output of the final LSTM 
unit as the output of this layer M € R?™, where 
dy is the dimensionality of the output of LSTM 
units. 
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4.7 Output Layer 


After obtaining the essay representation /, a lin- 
ear layer with sigmoid activation will predict the 
final output. The following equation defines the 
output layer: 


y = sigmoid(W,M + b,) (17) 


where W, is weight vector, and b, is bias vector. y 
is the final predicted score of the essay. 


5 Training 


Loss. Dong et al. (2017) used mean squared er- 
ror (MSE) loss, thus we use the same loss func- 
tion. MSE evaluates the average of squared error 
between the predicted score and the gold standard. 
Thus it is widely used in regression tasks. The fol- 
lowing equation defines MSE: 


=F Du —y) (8) 


where y; is the predicted score, y/ is the gold stan- 
dard, N it the total number of samples. 

Optimization. The optimizer we use is RMS- 
prop (Dauphin et al., 2015). The initial learning 
rate is 0.001, momentum is 0.9, and Dropout rate 
is 0.5 for preventing overfitting. These setting are 
the same as used by Dong et al. (2017). 


mse(y, y’) 


6 Experimental Setup 


We configure experiments to test three hypotheses: 


H1: the model we proposed (denoted by CO- 
ATTN) will outperform or at least per- 
form equally well as the baseline (denoted 
by SELF-ATTN) presented by Dong et al. 
(2017) on four ASAP essay corpora in the 
holistic score prediction task. 


H2: the model we proposed will outperform or at 
least perform equally well as the baseline on 
two RTA corpora in the Evidence score pre- 
diction task. 


H3: the model we proposed will outperform or at 
least perform equally well as the non-neural 
network baselines on both corpora. 


We use NLTK (Bird et al., 2009) for text prepro- 
cessing. The vocabulary size of the data is limited 
to 4000, and all scores are scaled to the range [0, 
1], following Taghipour and Ng (2016) and Dong 


et al. (2017). In particular, the 4000 most fre- 
quent words are preserved, with all other words 
treated as unknowns. The assessment scores will 
be converted back to their original range during 
evaluation. We use Quadratic Weighted Kappa 
(QWK) to evaluate our model. QWK is not 
only the official criteria of ASAP corpus, but 
also adopted as evaluation metric in Rahimi et al. 
(2014); Taghipour and Ng (2016); Dong et al. 
(2017); Rahimi et al. (2017); Zhang and Litman 
(2017) for both ASAP and RTA corpora. 

We use 5-fold cross-validation because both 
RTA and ASAP corpora have no released labeled 
test data. We split all corpora into 5 folds. For the 
ASAP corpus, the partition is the same as the set- 
ting presented by Taghipour and Ng (2016). For 
the RTA corpus, since there is no prior work to 
split the corpus, we separate it into 5 folds ran- 
domly. In each fold, 60% of the data are used for 
training, 20% of the data are the development set, 
and 20% of the data are used for testing. 

To select the best model, we trained each model 
on 100 epochs and evaluated on the development 
set after each epoch. The best model is the model 
with the best QWK on the development set. This 
is done five times, once for each partition in the 
cross-validation. Then the average QWK score 
from these five evaluations on the test set is re- 
ported. Paired t-tests are used for significance 
tests with p < 0.05. Table 3 shows all hyper- 
parameters for training. 

The code of SELF-ATTN are provided by Dong 
et al. (2017), they used Keras (Chollet et al., 2015) 
1.1.1 and Theano (Theano Development Team, 
2016) 0.8.2 as the backend. Because we are using 
Keras 2.1.3 and TensorFlow (Abadi et al., 2015) 
1.4.0 as the backend, we ran all experiments with 
our frameworks. Therefore, the numbers of SELF- 
ATTN have small differences to the numbers re- 
ported by the baseline model. 

For non-neural network baselines, we introduce 
the SVR and BLRR baselines presented by Phandi 
et al. (2015) for the ASAP corpus, and SG base- 
line presented by Zhang and Litman (2017) for the 
RTA corpus. 

SVR and BLRR models use Enhanced AI Scor- 
ing Engine (EASE) to extract four types of fea- 
tures, such as length, part of speech, prompt, and 
the bag of words. Then they use SVR and BLRR 
as the classifiers, respectively. We do not perform 


*https://github.com/edx/ease 
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any significance test on both SVR and BLRR be- 
cause we do not have detailed experiment data. 
Therefore, we only report the result presented in 
Phandi et al. (2015). 

SG model extracts evidence features based on 
hand-crafted topic and example lists, and uses ran- 
dom forest tree as the classifier. We follow the 
same data partition. However, we only use the 
training set for training and the testing set for test- 
ing while ignoring the development set so that we 
can perform the same paired t-tests in the experi- 
ments. 


Layer Parameter Name Value 
Embedding | Embedding dimension 50 
Word-CNN Kernel size 5 

Number of filters 100 
Sent-LSTM Hidden units 100 
Modeling Hidden units 100 
Dropout Dropout rate 0.5 
Others Epochs 100 
Batch size 100 

Initial learning rate 0.001 
Momentum 0.9 


Table 3: Hyper-parameters of training. 


7 Results 


We first examine H1. The results shown in Table 4 
support this hypothesis. The CO-ATTN model 
yields higher performance than the SELF-ATTN 
model on all ASAP prompts. However, the CO- 
ATTN model only significantly outperforms the 
SELF-ATTN model on Prompt 3. 

Second, we examine H2. Again, the results 
shown in Table 4 support this hypothesis. The CO- 
ATTN model yields higher performance than the 
SELF-ATTN model, significantly on both of the 
RTA corpora. 

Last, we examine H3. The results shown in 
Table 4 still support this hypothesis. The CO- 
ATTN model yields higher performance than all 
non-neural network baselines. 

The results show that in our tasks, the neu- 
ral network approaches are better than non-neural 
network baselines. One possible reason is the final 
representation of the essay from neural network 
contains more information. However, some of the 
information might be ignored by hand-crafted fea- 
tures. For example, the importance of different 
evidence in RTA task is not considered in the SG 


Prompts | SVR | BLRR | SG | SELF-ATTN | CO-ATTN 
RT Agpace | NA | NA | 0.632 0.690} 0.702++ 
ASAP; | 0.630 | 0.621 NA 0.677 0.697 « 
ASAP, | 0.749 | 0.784 | NA 0.807 0.809 
ASAP; | 0.782 | 0.784 | NA 0.806 0.815 
ASAPs | 0.771 | 0.775 NA 0.809 0.812 


Table 4: The performance (QWK) of the baselines and our model. « indicates that the model QWK is significantly 
better than the SELF-ATTN (p < 0.05). 7 indicates that the model QWK is significantly better than the SG 


(p < 0.05). The best results in each row are in bold. 


model. It treats all evidence equally. However, 
the neural network models capture this informa- 
tion automatically. 


Apparently, the CO-ATTN model performs bet- 
ter in the RTA tasks, because it always signifi- 
cantly outperforms the SELF-ATTN model. One 
possible reason is that the RTA task only considers 
the Evidence score. The CO-ATTN model is more 
suitable for the Evidence score prediction task be- 
cause it can find pieces of evidence that appear in 
both students’ essays and the source article better. 
In contrast, the SELF-ATTN model only consid- 
ers students’ essays associated with the scores. In 
this case, if a piece of evidence is not mentioned 
by students, this data-driven model cannot distin- 
guish it. Consequently, some important pieces of 
evidence will be assigned to a lower weight. How- 
ever, the CO-ATTN model considers not only the 
students’ essays but also the source article. In 
other words, if an important piece of evidence is 
not mentioned by too many students, but it is in 
the source article, the CO-ATTN model will as- 
sign this sentence higher attention. 


In the ASAP holistic score prediction task, al- 
though we still see a benefit in using the CO- 
ATTN model, it is reduced. In this case, the 
benefit we saw in the Evidence dimension from 
the CO-ATTN model becomes less significant be- 
cause the model also needs to consider more as- 
pects of the essay, such as organization, grammar 
mistakes, and so on. Our results suggest that the 
co-attention mechanism of the CO-ATTN model 
cannot capture these aspects significantly better 
than the SELF-ATTN model. Therefore, the CO- 
ATTN model only significantly outperforms the 
SELF-ATTN model on Prompt 3. 
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8 Discussion 


In Table 5, we list 10 sentences from student 
RT Ayvp essays and their associated attention 
scores. Because we have a list of examples man- 
ually extracted by our experts as important evi- 
dence from the RT'Ajyyp source article, exam- 
ining RTA data helps us understand the attention 
score assigned by our model. Bolded are exam- 
ples extracted by the expert from the source article 
that the student includes in the essay. A lower at- 
tention score means this sentence is less important. 
Otherwise, the score is high. As we can see, sen- 
tences 1, 2, 3, and 4 are low attention sentences, 
sentences 5, 6, and 7 are mid attention sentences, 
and sentences 8, 9, and 10 are high attention sen- 
tences. The attention scores reflect the importance 
of these sentences accurately. 

Sentence 1 is a short and general sentence re- 
lated to the source article, but it has no specific 
evidence from it. Sentence 2 even has no content 
related to the source article. Sentence 3 has many 
details related to the source article. However, it 
still has no evidence directly from the source ar- 
ticle. Sentence 4 mentions “The author did con- 
vince me that winning the fight against poverty is 
achievable in our lifetime” which comes from both 
the prompt and the source article, but this state- 
ment is so general that almost every student men- 
tions this statement in the essay which makes this 
statement not distinguishable. For these reasons, 
these four sentences receive low attention scores. 

Although sentence 5 is short, it mentions one 
piece of evidence. Sentence 6 talks about farm- 
ing which is a topic from the source article. In 
the article, the things listed in this sentence are 
things the farmer needs to worry about. However, 
this sentence indicates “the farmer don’t have to 
worry’ because of the MVP project. Sentence 7 
also mentions conditions of hospitals nowadays. 


However, it mentions not only water but also elec- 
tricity which is more than Sentence 5. For these 
reasons, these three sentences receive mid atten- 
tion scores from low to high. 

The last three sentences receive high attention 
scores because they all use more pieces of evi- 
dence directly from the source article. Sentence 8 
talks about the school, and Sentence 9 talks about 
the hospital. Sentence 10 talks about farming. 
However, sentence 10 receives the highest atten- 
tion score, because it mentions evidence from both 
before and after the MVP project. 


Attention 
0.00173 
0.00174 


No. | Sentences 

1 Life in Kenya is hard. 

In this essay I will give my top 3 rea- 
sons why. 

3 Because like I said, we have more 
advanced & better & more qualified 
materials than them, and these days 
kids & adults are spoiled, we have 
phones stores, houses & even shoes 
and clothes. 

4 The author did convince me that 
winning the fight against poverty 
is achievable in our lifetime be- 
cause she showed me how many 
people in Sauri, Kenya need our help 
against poverty. 

5 Water is connected to the hospi- 
tals. 

6 So the farmer don’t have to worry 
all the time that him or his family 
won’t have enough food to eat and 
the farmer have to worry that their 
kids will get hungry and then sick. 

7 The hospital aslo has water and 
electricity. 

8 Also, there were no school fees, and 
the school now serves lunch for the 
students because they didn’t have 
any midday meals to provide them 
with energy they need to help them 
with the rest of their days. 

9 In 2008 though, when they checked 
for progress, the hospital had 
medicine, free of charge, with run- 
ning water and electricty. 

10 | Also farmers could not afford fer- 
tilizer and irrigation but now they 
placed irrigation and have them 
fertilizer for the crops. 


0.00243 


0.00229 


0.02936 


0.05580 


0.07746 


0.19483 


0.20177 


0.25855 


Table 5: Example attention scores of essay sentences. 


From these sentences, we can also see that the 
attention score depends on neither the length of 
the sentence nor only the specificity of the sen- 
tence. It instead depends on how many impor- 
tant pieces of evidence there are in the sentence. 
For example, Sentence 3 is long and talks about 
some details of our modern life. Although it also 
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talks about quality materials or better housing and 
clothing compared to people living in Kenya, it re- 
ceives a low attention score because there is no 
specific evidence directly from the source article. 
In contrast, Sentence 9 is shorter than Sentence 3. 
However, it receives a higher attention score be- 
cause it mentions many pieces of evidence from 
the source article. 

Overall, the CO-ATTN model seems to capture 
the importance of sentences by assigning reason- 
able attention scores based on the relevance of the 
sentence to the source article. 


9 Conclusion and Future Work 


In this paper, we presented a co-attention based 
neural network model that outperforms a state of 
the art attention based neural network model for 
essay scoring, not only for RTA Evidence assess- 
ment but also for holistic assessment of ASAP 
source-dependent responses. Advantages of our 
model are that it does not need any expert pre- 
processing of the source article; the input of this 
model is only the raw student essay and its source 
article. Moreover, our model somewhat captures 
the importance of different pieces of evidence, al- 
though it is not specifically designed for this pur- 
pose. However, quantitative experiments that can 
answer whether the attention scores are correlated 
to the importance of different pieces of evidence 
need to be done. Also, this leads to an interesting 
future investigation, development of a neural net- 
work approach that both has an acceptable score 
prediction, and can simultaneously generate evi- 
dence lists from the source article. Another inter- 
esting future investigation could be examining the 
ability of this model to generalize to a new prompt. 
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